Imagine you live in Toronto and have been offered a job opportunity in New York city. You really enjoy your neighborhood in Toronto and would like to move to a similar neighborhood in New York city. However, you don’t know New York. Or you may know that you want to narrow your search to Manhattan, but where in Manhattan would you choose?
In my project as a Data Incubator fellow, I would like to address this question. Let’s say given two cities, can we use data to find neighborhoods that are similar in both cities?
I will be using the following data sources:
Wikipedia: First, I scrape Wikipedia’s page for Toronto’s neighborhoods using python’s beautiful soup and requests packages to get the list of postcodes, boroughs and neighborhoods of Toronto. https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
NYU Spatial Data Repository: Next, I download the json file of New York’s neighborhoods from NYU’s spatial data repository website. https://cocl.us/new_york_dataset
Google maps platform: Then, I use Google’s Geocoding API and python’s geopy package to get the latitude and longitude of Toronto’s and New York’s neighborhoods. https://developers.google.com/maps/documentation/geocoding/start
Foursquare Places API: Next, I utilize the Foursquare Places API for gathering the businesses/venues information in the two cities. For each venue, I get the following: Name, category, latitude, longitude. https://developer.foursquare.com/places-api
NYC Open Data: Finally, I download New York House Sales Data (2003-2015) from NYC Open Data website. https://data.cityofnewyork.us/Housing-Development/NYC-Calendar-Sales-Archive-/uzf5-f8n2
After acquiring the data as mentioned above, I clean the data and transform it into the appropriate formats.
Basically, for each neighborhood in Toronto and Manhattan, I search for fifty venues in that neighborhood. (Fifty is the limit for this query on Foursquare API.)
Next, I profile each neighborhood by creating one hot encoding of venue categories and group the record by neighborhoods and take the mean. Now I have the mean frequency of occurrence of each venue category.
I can further explore this to get a sense of distribution of categories in each neighborhood. So, I create a function to give me the top ten venue categories in each neighborhood. Now, I can get some sense of what each neighborhood is about.
Next, I’m going to use the one hot encoding to cluster the neighborhoods. I choose k-means clustering using python’s scikit-learn library. I try the elbow method to find the appropriate number of clusters. There wasn’t very sharp cut elbow but I at k equal to eight is best one. I choose a k equal 8 clusters, and cluster the neighborhoods.
With the help of python’s folium library, I create interactive leaflet maps and visualize the clusters on the map for each city. Each circle on the map contains the neighborhood, cluster label and the borough.
Now, I can see which Manhattan neighborhoods are in the same cluster as my original Toronto neighborhood. These should be the most similar neighborhoods based on the venue data we have.
Finally, I utilize Manhattan house sales price. I group the sales by neighborhood, year and month, take their averages and create a new data frame.
Now, I use the monthly average neighborhood house sale price in two ways. First, I use the latest monthly as a filtering criterion; I can determine that I’m interested only in neighborhoods with the average price in a certain range. This may eliminate some of the similar neighborhoods or at least give the edge to some.
I also utilize python’s Keras library to train an LSTM model to predict future average sales price for that neighborhood. This could be used as the other criterion for choosing the best neighborhood fit as it presents good investing opportunity or prevents a risky one for the person who’s moving to Manhattan
In summary, I was able to find the similar neighborhoods in both cities through clustering. Though this can be improved upon greatly in terms of algorithm or the similarity measure. My experiment with LSTM wasn’t very successful and more time and manipulation of the structure and parameters of the LSTM is necessary to be able to get a good model.
If I’m lucky to become a data incubator fellow, during the eight weeks fellowships, time allowing, I would like to improve this project. The followings are some of my ideas:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import geocoder
from geopy.geocoders import Nominatim
from geopy.geocoders import GoogleV3
import folium # map rendering library
import sys
import numpy as np
from sklearn.cluster import KMeans
from collections import Counter
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as colors
import seaborn as sns
from sklearn.decomposition import PCA
import glob
from pandas import ExcelWriter
from pandas import ExcelFile
from sklearn import metrics
# from sklearn.preprocessing import MinMaxScaler
# from keras.models import Sequential
# from keras.layers import Dense
# from keras.layers import LSTM
# from keras.layers import Dropout
First, I'm going to scrape the wikipedia page below to obtain Toronto's neighborhoods and tranform the data into a pandas data frame.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
# using requests and bs4 to scrape Wikipedia table:
url = 'https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M' #url of Toronto's neighborhoods information
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
# type(soup)
# print(soup.prettify())
data = []
table = soup.find('table', class_='sortable')
ths = table.find_all('th')
headings = [th.text.strip() for th in ths]
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
tor_data = pd.DataFrame(data=data, columns=headings)
tor_data.to_csv('../data/toronto.csv')
print(tor_data.shape)
tor_data.head()
tor_data = tor_data.drop(['Postcode'], axis=1)
tor_data = tor_data.dropna()
tor_data =tor_data[tor_data['Borough'] != 'Not assigned']
#keep only boroughs with "Toronto" in their name
tor_data = tor_data[tor_data.Borough.str.contains('Toronto', regex=True)].reset_index(drop=True)
tor_data= tor_data.rename(columns = {'Neighbourhood':'Neighborhood'})
tor_data.shape
Getting the lattitude and longitude of neighborhoods
I use python's Geocoder package to retirve the lattitude and longitude of the neighborhoods of Toronto.
latitude=[]
longitude=[]
# initialize your variable to None
lat_lng_coords = None
for Neighborhood in tor_data['Neighborhood']:
# # loop until you get the coordinates
# while(lat_lng_coords is None):
try:
g = geocoder.google('{}, Toronto, Ontario'.format(Neighborhood))
lat_lng_coords = g.latlng
latitude.append(lat_lng_coords[0])
longitude.append(lat_lng_coords[1])
except:
latitude.append(None)
longitude.append(None)
latitude[0:5]
The geocoder pakage didn't work. We keep gettin None values. I try the geopy package instead now.
geolocator = Nominatim(user_agent="HS")
latitude=[]
longitude=[]
for Neighborhood in tor_data['Neighborhood']:
# print(Neighborhood)
try:
location = geolocator.geocode('{}, Toronto, Ontario'.format(Neighborhood))
latitude.append(location.latitude)
longitude.append(location.longitude)
except:
latitude.append('Not Found')
longitude.append('Not Found')
longitude[0:10]
This didn't work either. So I'm going to sign up for google geocoding API and create a key and use it along geopy.
GeoPy’s documentation
Get Started with Google Maps Platform
Developer Guide: What is Geocoding?
I will place google API key in a text file and read it from there so it won't get exposed.
def getNeighborhoodLatLong(city_data, address, city_name):
f = open('../API info/google_api_key.txt')
google_api_key = f.read()
geolocator = GoogleV3(api_key = google_api_key)
latitude=[]
longitude=[]
for Neighborhood in city_data['Neighborhood']:
try:
location = geolocator.geocode('{}, {}'.format(Neighborhood, address))
latitude.append(location.latitude)
longitude.append(location.longitude)
except:
latitude.append('Not Found')
longitude.append('Not Found')
city_data['Latitude'] = latitude
city_data['Longitude'] = longitude
city_data = city_data[city_data['Latitude'] != 'Not Found']
city_data.to_csv('../data/{}_processed.csv'.format(city_name))
print(city_data.shape)
return city_data.head()
getNeighborhoodLatLong(city_data=tor_data, address= 'Toronto, Ontario', city_name= 'toronto')
Lets first define a function to get the latitude and longitude of Toronto. I use nominatim for this.
def getCoordinate(address):
geolocator = Nominatim(user_agent="HS")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
# print('The geograpical coordinate of {} are {}, {}.'.format(address, latitude, longitude))
return latitude,longitude
getCoordinate('Toronto, Ontario')
# create map of the city using latitude and longitude values
def makeMap(address, city_data, zoom_level):
latitude,longitude = getCoordinate(address)
m = folium.Map(location=[latitude, longitude], zoom_start=zoom_level)
# add markers to map
for lat, lng, borough, neighborhood in zip(city_data['Latitude'], city_data['Longitude'],
city_data['Borough'], city_data['Neighborhood']):
label = '{}- {}'.format(neighborhood, borough)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(m)
return m
makeMap(address='Toronto, Ontario' , city_data= tor_data, zoom_level=12)
Now I am going to start utilizing the Foursquare API to explore the neighborhoods and segment them.
First, I create an account on Foursqure developer website: https://developer.foursquare.com/
There are two types of free account with different rate limits that I can use:
To upgrade to personal account, I just need to provide my credit card information for verification purposes.
Foursqaure API documentation (https://developer.foursquare.com/docs) will tell us about the regular and premium calls, and how to use the API.
More information about rate limits also can be found in the documentation:
https://developer.foursquare.com/docs/api/troubleshooting/rate-limits
The information that I need for this project can be found under Places API.
Next, I "create a new app". This is straight forward in the developer portal. After doing this, I have my app's client ID and client secret.
Now, I create a function to get the data from Foursqaure, clean the json and structure it into a pandas dataframe.
I will place my CLIENT_ID and CLIENT_SECRET in text files and read them from there so they are not revealed to the public.
def getNearbyVenues(city, city_data, radius, LIMIT):
# loading Foursquare API information from the files
CLIENT_ID = open('../API info/foursquare_client_id.txt').read()
CLIENT_SECRET = open('../API info/foursquare_client_secret.txt').read()
VERSION = '20180605' # Foursquare API version
venues_list=[]
for borough, neighborhood, lat, lng in zip(city_data['Borough'], city_data['Neighborhood'],
city_data['Latitude'], city_data['Longitude']):
# print(neighborhood)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
try:
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
borough,
neighborhood,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
except:
venues_list.append([borough,neighborhood,lat,lng,None,None,None,None])
try:
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Borough',
'Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
nearby_venues.to_csv('../data/%s_venues.csv'%city)
return nearby_venues
except:
print("Oops!",sys.exc_info()[0],"occured.")
print('You still need to put the output into a data frame. \nUncomment the cell below and run it.')
return venues_list
tor_venues = getNearbyVenues(city='toronto',
city_data= tor_data,
radius=500, # measured the radius of some of Toronto's neighborhoods on google map.
LIMIT=50
)
Let's take a look at the dataframe.
print(tor_venues.shape)
tor_venues.head()
Let's check how many venues were returned for each Neighbourhood.
tor_venues.groupby('Neighborhood').count()
Let's find out how many unique categories can be curated from all the returned venues.
print('There are {} uniques categories.'.format(len(tor_venues['Venue Category'].unique())))
Now let's create a new dataframe and display the top 10 venues for each neighborhood.
url = 'https://cocl.us/new_york_dataset'
r = requests.get(url)
# r
ny_json = r.json()
# ny_json
ny_json_neighborhoods = ny_json['features']
Let's take a look at the first item in this list.
ny_json_neighborhoods[0]
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
# instantiate the dataframe
ny_data = pd.DataFrame(columns=column_names)
for data in ny_json_neighborhoods:
borough = data['properties']['borough']
neighborhood = data['properties']['name']
neighborhood_latlon = data['geometry']['coordinates']
neighborhood_lat = neighborhood_latlon[1]
neighborhood_lon = neighborhood_latlon[0]
ny_data = ny_data.append({'Borough': borough,
'Neighborhood': neighborhood,
'Latitude': neighborhood_lat,
'Longitude': neighborhood_lon}, ignore_index=True)
ny_data.to_csv('../data/new york.csv')
print(ny_data.shape)
ny_data.head()
print('The dataframe has {} boroughs and {} neighborhoods and {} records.'.format(
len(ny_data['Borough'].unique()),
len(ny_data['Neighborhood'].unique()),
ny_data.shape[0]
)
)
Let's create the map of New York with the neighborhoods.
makeMap(address='New York City, NY' ,city_data= ny_data, zoom_level=11)
ny_venues = getNearbyVenues(city='new york',
city_data= ny_data,
radius=1000, # measured the radius of some of New York's neighborhoods on google map.
LIMIT=50
)
# df_ny[~df_ny['Neighborhood'].isin(ny_venues['Neighbourhood'])]
# nearby_venues = pd.DataFrame([item for venue_list in ny_venues for item in venue_list])
# ny_venues[1:10]
# print(pd.DataFrame(Tuple for venue_list in ny_venues[1:10] for Tuple in venue_list))
# nearby_venues = pd.DataFrame(Tuple for venue_list in ny_venues[1:193] for Tuple in venue_list)
# ny_venues[193:195]
Let's take a look at the dataframe.
print(ny_venues.shape)
ny_venues.head()
Let's check how many venues were returned for each neighborhood.
ny_venues.groupby('Neighborhood').count()
print('There are {} uniques categories.'.format(len(ny_venues['Venue Category'].unique())))
ny_data= pd.read_csv('../data/new york.csv')
ny_venues = pd.read_csv('../data/new york_venues.csv')
mn_data = ny_data[ny_data['Borough'] == 'Manhattan']
mn_data.to_csv('../data/manhattan.csv')
mn_venues = ny_venues[ny_venues['Borough'] == 'Manhattan']
mn_venues.to_csv('../data/manhattan_venues.csv')
print(mn_data.shape)
mn_data.head()
print(mn_venues.shape)
mn_venues.head()
print('There are {} uniques categories.'.format(len(mn_venues['Venue Category'].unique())))
mn_venues.groupby('Neighborhood').count()
br_data = ny_data[ny_data['Borough'] == 'Brooklyn']
br_data.to_csv('../data/brooklyn.csv')
br_venues = ny_venues[ny_venues['Borough'] == 'Brooklyn']
br_venues.to_csv('../data/brooklyn_venues.csv')
print(br_data.shape)
br_data.head()
print(br_venues.shape)
br_venues.head()
print('There are {} uniques categories.'.format(len(br_venues['Venue Category'].unique())))
br_venues.groupby('Neighborhood').count()
tor_venues = pd.read_csv('../data/toronto_venues.csv')
mn_venues = pd.read_csv('../data/manhattan_venues.csv')
Combining the data frames for data analysis.
tor_mn_venues = pd.concat([tor_venues, mn_venues], axis=0, ignore_index=True)
#one of the venue categories is "Neighborhood", changed it to prevent conflict later.
tor_mn_venues['Venue Category'].replace('Neighborhood','Neighborhood_venu', inplace=True)
print(tor_mn_venues.shape)
tor_mn_venues.head()
Look at 277 uniqure categories of manhattan vs 268 of toronto. see how different they are.
print(len(tor_venues['Venue Category'].unique()))
print(len(mn_venues['Venue Category'].unique()))
print(len(tor_mn_venues['Venue Category'].unique()))
tor_mn_venues_one_hot = pd.get_dummies(tor_mn_venues['Venue Category'])
print(tor_mn_venues_one_hot.shape)
tor_mn_venues_one_hot.head()
# add columns to the new dataframe
tor_mn_venues_one_hot[['Neighborhood']] = tor_mn_venues[['Neighborhood']]
# move borough... columns to the beginning
new_cols_order = list(tor_mn_venues_one_hot.columns[-1:]) + list(tor_mn_venues_one_hot.columns[:-1])
tor_mn_venues_one_hot = tor_mn_venues_one_hot[new_cols_order]
# or
# tor_mn_venues_one_hot.insert(0, 'Borough', tor_mn_venues['Borough'])
# tor_mn_venues_one_hot.insert(1, 'Neighborhood', tor_mn_venues['Neighborhood'])
print(tor_mn_venues_one_hot.shape)
tor_mn_venues_one_hot.head()
Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.
tor_mn_venues_grouped = tor_mn_venues_one_hot.groupby('Neighborhood').mean().reset_index()
print(tor_mn_venues_grouped.shape)
tor_mn_venues_grouped.head()
Let's print each neighborhood along with the top 5 most common venues.
num_top_venues = 5
for hood in tor_mn_venues_grouped['Neighborhood'][0:5]:
print("----"+hood+"----")
temp = tor_mn_venues_grouped[tor_mn_venues_grouped['Neighborhood'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
tor_mn_venues_sorted = pd.DataFrame(columns=columns)
tor_mn_venues_sorted['Neighborhood'] = tor_mn_venues_grouped['Neighborhood']
for ind in np.arange(tor_mn_venues_grouped.shape[0]):
tor_mn_venues_sorted.iloc[ind, 1:] = return_most_common_venues(tor_mn_venues_grouped.iloc[ind, :], num_top_venues)
tor_mn_venues_sorted.head()
Now, I'm going to run k-means to cluster the boroughs. I use the elbow method to find the best number of clusters.
# A loop will be used to plot the explanatory power for up to 10 KMeans clusters
ks = range(1, 15)
inertias = []
tor_mn_venues_grouped_clustering = tor_mn_venues_grouped.drop('Neighborhood', 1)
# tor_mn_venues_grouped_clustering = tor_mn_venues_grouped.drop('Borough', 1)
for k in ks:
# Initialize the KMeans object using the current number of clusters (k)
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=0)
# Fit the scaled features to the KMeans object
km.fit(tor_mn_venues_grouped_clustering)
# Append the inertia for `km` to the list of inertias
inertias.append(km.inertia_)
# Plot the results in a line plot
plt.plot(ks, inertias, marker='o')
There is an elbow at k=4. There are more sophisticated ways of picking the nuber of clusters which I will utilize later.
# set number of clusters
kclusters = 5
tor_mn_venues_grouped_clustering = tor_mn_venues_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=10, random_state=0).fit(tor_mn_venues_grouped_clustering)
# check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:kclusters]
Counter(kmeans.labels_)
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
# add clustering labels
tor_mn_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
tor_mn_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
tor_mn_venues_merged = tor_mn_venues
tor_mn_venues_merged = pd.merge(tor_mn_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
tor_mn_venues_sorted,
left_on='Neighborhood',
right_on='Neighborhood',
how='inner')
print(tor_mn_venues_sorted.shape)
print(tor_mn_venues.shape)
print(tor_mn_venues_merged.shape)
tor_mn_venues_merged.head() # check the last columns!
Finally, let's visualize the resulting clusters.
tor_mn_venues_merged_tor = tor_mn_venues_merged[tor_mn_venues_merged.Borough.str.contains('Toronto', regex=True)]
print(tor_mn_venues_merged_tor.shape)
tor_mn_venues_merged_tor.to_csv('../data/toronto_venues_clustered.csv')
tor_mn_venues_merged_tor['Cluster Labels'].value_counts()
tor_mn_venues_merged_mn = tor_mn_venues_merged[tor_mn_venues_merged['Borough'] == 'Manhattan']
print(tor_mn_venues_merged_mn.shape)
tor_mn_venues_merged_mn.columns
tor_mn_venues_merged_mn.to_csv('../data/manhattan_venues_clustered.csv')
tor_mn_venues_merged_mn['Cluster Labels'].value_counts()
def makeClusterMap(address, city_data, city, zoom_level):
latitude,longitude = getCoordinate(address)
m = folium.Map(location=[latitude, longitude], zoom_start= zoom_level, width=1000, height=500)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(city_data['Neighborhood Latitude'], city_data['Neighborhood Longitude'],
city_data['Neighborhood'], city_data['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=5,
popup=label,
color=rainbow[cluster-1],
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(m)
m.save('../plots/%s_cluster_map.html'%city)
return m
makeClusterMap(address='Manhattan, NY' ,city_data=tor_mn_venues_merged_mn , city= 'manhattan_toronto', zoom_level=11)
makeClusterMap(address='Toronto, Ontario' ,city_data= tor_mn_venues_merged_tor, city= 'toronto_manhattan', zoom_level=12)
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can also assign a name to each cluster.
for cl in range(0,kclusters):
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == cl, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]].reset_index(drop=True).to_html('../tables/toronto_manhattan_cl_{}.html'.format(cl))
Cluster 0
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 0, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 1
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 1, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 2
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 2, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 3
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 3, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 4
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 4, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 5
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 5, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 6
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 6, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
Cluster 7
tor_mn_venues_merged.loc[tor_mn_venues_merged['Cluster Labels'] == 7, tor_mn_venues_merged.columns[[1] + list(range(5, tor_mn_venues_merged.shape[1]))]]
metrics.silhouette_score(tor_mn_venues_grouped_clustering, kmeans.labels_, metric='euclidean')
for linkage in ['ward', 'complete', 'average', 'single']:
for n_clusters in range(2, 10):
clustering(data=tor_mn_venues_grouped_clustering, linkage=linkage, n_clusters=n_clusters)
metrics.silhouette_score(tor_mn_venues_grouped_clustering, clustering.labels_, metric='euclidean')
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
# add clustering labels
tor_mn_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
tor_mn_venues_sorted.insert(0, 'Cluster Labels', clustering.labels_)
tor_mn_venues_merged = tor_mn_venues
tor_mn_venues_merged = pd.merge(tor_mn_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
tor_mn_venues_sorted,
left_on='Neighborhood',
right_on='Neighborhood',
how='inner')
print(tor_mn_venues_sorted.shape)
print(tor_mn_venues.shape)
print(tor_mn_venues_merged.shape)
tor_mn_venues_merged.head() # check the last columns!
Finally, let's visualize the resulting clusters.
tor_mn_venues_merged_tor = tor_mn_venues_merged[tor_mn_venues_merged.Borough.str.contains('Toronto', regex=True)]
print(tor_mn_venues_merged_tor.shape)
tor_mn_venues_merged_tor.to_csv('../data/toronto_venues_clustered.csv')
tor_mn_venues_merged_tor['Cluster Labels'].value_counts()
tor_mn_venues_merged_mn = tor_mn_venues_merged[tor_mn_venues_merged['Borough'] == 'Manhattan']
print(tor_mn_venues_merged_mn.shape)
tor_mn_venues_merged_mn.columns
tor_mn_venues_merged_mn.to_csv('../data/manhattan_venues_clustered.csv')
tor_mn_venues_merged_mn['Cluster Labels'].value_counts()
makeClusterMap(address='Manhattan, NY' ,city_data=tor_mn_venues_merged_mn , city= 'manhattan_toronto', zoom_level=11)
makeClusterMap(address='Toronto, Ontario' ,city_data= tor_mn_venues_merged_tor, city= 'toronto_manhattan', zoom_level=12)
Combining the data frames for data analysis.
????????
br_mn_venues = pd.concat([br_venues, mn_venues], axis=0, ignore_index=True)
#one of the venue categories is "Neighborhood", changed it to prevent conflict later.
br_mn_venues['Venue Category'].replace('Neighborhood','Neighborhood_venu', inplace=True)
print(br_mn_venues.shape)
br_mn_venues.head()
Look at 277 uniqure categories of manhattan vs 268 of toronto. see how different they are.
print(len(br_venues['Venue Category'].unique()))
print(len(mn_venues['Venue Category'].unique()))
print(len(br_mn_venues['Venue Category'].unique()))
br_mn_venues_one_hot = pd.get_dummies(br_mn_venues['Venue Category'])
print(br_mn_venues_one_hot.shape)
br_mn_venues_one_hot.head()
# add columns to the new dataframe
br_mn_venues_one_hot[['Neighborhood']] = br_mn_venues[['Neighborhood']]
# move borough... columns to the beginning
new_cols_order = list(br_mn_venues_one_hot.columns[-1:]) + list(br_mn_venues_one_hot.columns[:-1])
br_mn_venues_one_hot = br_mn_venues_one_hot[new_cols_order]
# or
# tor_mn_venues_one_hot.insert(0, 'Borough', tor_mn_venues['Borough'])
# tor_mn_venues_one_hot.insert(1, 'Neighborhood', tor_mn_venues['Neighborhood'])
print(br_mn_venues_one_hot.shape)
br_mn_venues_one_hot.head()
Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.
br_mn_venues_grouped = br_mn_venues_one_hot.groupby('Neighborhood').mean().reset_index()
print(br_mn_venues_grouped.shape)
br_mn_venues_grouped.head()
Let's print each neighborhood along with the top 5 most common venues.
num_top_venues = 5
for hood in br_mn_venues_grouped['Neighborhood'][0:5]:
print("----"+hood+"----")
temp = br_mn_venues_grouped[br_mn_venues_grouped['Neighborhood'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
br_mn_venues_sorted = pd.DataFrame(columns=columns)
br_mn_venues_sorted['Neighborhood'] = br_mn_venues_grouped['Neighborhood']
for ind in np.arange(br_mn_venues_grouped.shape[0]):
br_mn_venues_sorted.iloc[ind, 1:] = return_most_common_venues(br_mn_venues_grouped.iloc[ind, :], num_top_venues)
br_mn_venues_sorted.head()
Now, I'm going to run k-means to cluster the boroughs. I use the elbow method to find the best number of clusters.
# A loop will be used to plot the explanatory power for up to 10 KMeans clusters
ks = range(1, 15)
inertias = []
br_mn_venues_grouped_clustering = br_mn_venues_grouped.drop('Neighborhood', 1)
# br_mn_venues_grouped_clustering = br_mn_venues_grouped.drop('Borough', 1)
for k in ks:
# Initialize the KMeans object using the current number of clusters (k)
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=0)
# Fit the scaled features to the KMeans object
km.fit(br_mn_venues_grouped_clustering)
# Append the inertia for `km` to the list of inertias
inertias.append(km.inertia_)
# Plot the results in a line plot
plt.plot(ks, inertias, marker='o')
There isn't a clear elbow. For now I pic k=5. There are more sophisticated ways of picking the nuber of clusters which I will utilize later.
# set number of clusters
kclusters = 5
# br_mn_venues_grouped_clustering = br_mn_venues_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=10, random_state=0).fit(br_mn_venues_grouped_clustering)
# check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:kclusters]
Counter(kmeans.labels_)
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
# add clustering labels
# br_mn_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
br_mn_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
br_mn_venues_merged = br_mn_venues
br_mn_venues_merged = pd.merge(br_mn_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
br_mn_venues_sorted,
left_on='Neighborhood',
right_on='Neighborhood',
how='inner')
print(br_mn_venues_sorted.shape)
print(br_mn_venues.shape)
print(br_mn_venues_merged.shape)
br_mn_venues_merged.head() # check the last columns!
Finally, let's visualize the resulting clusters.
br_mn_venues_merged_mn = br_mn_venues_merged[br_mn_venues_merged['Borough'] == 'Manhattan']
print(br_mn_venues_merged_mn.shape)
br_mn_venues_merged_mn.columns
br_mn_venues_merged_mn.to_csv('../data/manhattan_brooklyn_venues_clustered.csv')
br_mn_venues_merged_mn['Cluster Labels'].value_counts()
br_mn_venues_merged_br = br_mn_venues_merged[br_mn_venues_merged['Borough'] == 'Brooklyn']
print(br_mn_venues_merged_br.shape)
br_mn_venues_merged_br.columns
br_mn_venues_merged_br.to_csv('../data/brooklyn_manhattan_venues_clustered.csv')
br_mn_venues_merged_br['Cluster Labels'].value_counts()
makeClusterMap(address='Brooklyn, NY' ,city_data=br_mn_venues_merged_br , city= 'brooklyn_manhattan', zoom_level=11)
makeClusterMap(address='Manhattan, NY' ,city_data= br_mn_venues_merged_mn, city= 'manhattan_brooklyn', zoom_level=11)
First, I'm going to save the dataframes into html files for later use.
for cl in range(0,kclusters):
br_mn_venues_merged.loc[br_mn_venues_merged['Cluster Labels'] == cl, br_mn_venues_merged.columns[[1] + list(range(5, br_mn_venues_merged.shape[1]))]].reset_index(drop=True).to_html('../tables/brooklyn_manhattan_cl_{}.html'.format(cl))
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can also assign a name to each cluster.
Cluster 0
br_mn_venues_merged.loc[br_mn_venues_merged['Cluster Labels'] == 0, br_mn_venues_merged.columns[[1] + list(range(5, br_mn_venues_merged.shape[1]))]]
Cluster 1
br_mn_venues_merged.loc[br_mn_venues_merged['Cluster Labels'] == 1, br_mn_venues_merged.columns[[1] + list(range(5, br_mn_venues_merged.shape[1]))]]
Cluster 2
br_mn_venues_merged.loc[br_mn_venues_merged['Cluster Labels'] == 2, br_mn_venues_merged.columns[[1] + list(range(5, br_mn_venues_merged.shape[1]))]]
Cluster 3
br_mn_venues_merged.loc[br_mn_venues_merged['Cluster Labels'] == 3, br_mn_venues_merged.columns[[1] + list(range(5, br_mn_venues_merged.shape[1]))]]
Cluster 4
br_mn_venues_merged.loc[br_mn_venues_merged['Cluster Labels'] == 4, br_mn_venues_merged.columns[[1] + list(range(5, br_mn_venues_merged.shape[1]))]]
metrics.silhouette_score(br_mn_venues_grouped_clustering, kmeans.labels_, metric='euclidean')
# for linkage in ['ward', 'complete', 'average', 'single']:
# for n_clusters in range(2, 10):
# clustering(data=br_mn_venues_grouped_clustering, linkage=linkage, n_clusters=n_clusters)
# metrics.silhouette_score(br_mn_venues_grouped_clustering, clustering.labels_, metric='euclidean')
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
# # add clustering labels
# br_mn_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
# br_mn_venues_sorted.insert(0, 'Cluster Labels', clustering.labels_)
# br_mn_venues_merged = br_mn_venues
# br_mn_venues_merged = pd.merge(br_mn_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
# br_mn_venues_sorted,
# left_on='Neighborhood',
# right_on='Neighborhood',
# how='inner')
# print(br_mn_venues_sorted.shape)
# print(br_mn_venues.shape)
# print(br_mn_venues_merged.shape)
# br_mn_venues_merged.head() # check the last columns!
Finally, let's visualize the resulting clusters.
br_mn_venues_merged_mn = br_mn_venues_merged[br_mn_venues_merged['Borough'] == 'Manhattan']
print(br_mn_venues_merged_mn.shape)
br_mn_venues_merged_mn.columns
br_mn_venues_merged_mn.to_csv('../data/manhattan_brooklyn_venues_clustered.csv')
br_mn_venues_merged_mn['Cluster Labels'].value_counts()
br_mn_venues_merged_br = br_mn_venues_merged[br_mn_venues_merged['Borough'] == 'Brooklyn']
print(br_mn_venues_merged_br.shape)
br_mn_venues_merged_br.columns
br_mn_venues_merged_br.to_csv('../data/brooklyn_manhattan_venues_clustered.csv')
br_mn_venues_merged_br['Cluster Labels'].value_counts()
makeClusterMap(address='Brooklyn, NY' ,city_data=br_mn_venues_merged_br , city= 'brooklyn_manhattan', zoom_level=11)
makeClusterMap(address='Manhattan, NY' ,city_data= br_mn_venues_merged_mn, city= 'manhattan_brooklyn', zoom_level=11)
mn_venues_one_hot = pd.get_dummies(mn_venues['Venue Category'])
print(mn_venues_one_hot.shape)
mn_venues_one_hot.head()
# add columns to the new dataframe
mn_venues_one_hot[['Neighborhood']] = mn_venues[['Neighborhood']]
# move borough... columns to the beginning
new_cols_order = list(mn_venues_one_hot.columns[-1:]) + list(mn_venues_one_hot.columns[:-1])
mn_venues_one_hot = mn_venues_one_hot[new_cols_order]
# or
# mn_venues_one_hot.insert(0, 'Borough', mn_venues['Borough'])
# mn_venues_one_hot.insert(1, 'Neighborhood', mn_venues['Neighborhood'])
print(mn_venues_one_hot.shape)
mn_venues_one_hot.head()
Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.
mn_venues_grouped = mn_venues_one_hot.groupby('Neighborhood').mean().reset_index()
print(mn_venues_grouped.shape)
mn_venues_grouped.head()
Let's print each neighborhood along with the top 5 most common venues.
num_top_venues = 5
for hood in mn_venues_grouped['Neighborhood'][0:5]:
print("----"+hood+"----")
temp = mn_venues_grouped[mn_venues_grouped['Neighborhood'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
mn_venues_sorted = pd.DataFrame(columns=columns)
mn_venues_sorted['Neighborhood'] = mn_venues_grouped['Neighborhood']
for ind in np.arange(mn_venues_grouped.shape[0]):
mn_venues_sorted.iloc[ind, 1:] = return_most_common_venues(mn_venues_grouped.iloc[ind, :], num_top_venues)
mn_venues_sorted.head()
Now, I'm going to run k-means to cluster the boroughs. I use the elbow method to find the best number of clusters.
# A loop will be used to plot the explanatory power for up to 10 KMeans clusters
ks = range(1, 15)
inertias = []
mn_venues_grouped_clustering = mn_venues_grouped.drop('Neighborhood', 1)
# mn_venues_grouped_clustering = mn_venues_grouped.drop('Borough', 1)
for k in ks:
# Initialize the KMeans object using the current number of clusters (k)
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=0)
# Fit the scaled features to the KMeans object
km.fit(mn_venues_grouped_clustering)
# Append the inertia for `km` to the list of inertias
inertias.append(km.inertia_)
# Plot the results in a line plot
plt.plot(ks, inertias, marker='o')
There isn't a clear elbow. I choose k=5 for now. There are more sophisticated ways of picking the nuber of clusters which I will utilize later.
# set number of clusters
kclusters = 5
mn_venues_grouped_clustering = mn_venues_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=10, random_state=0).fit(mn_venues_grouped_clustering)
# check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:kclusters]
Counter(kmeans.labels_)
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
# add clustering labels
# mn_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
mn_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
mn_venues_merged = mn_venues
mn_venues_merged = pd.merge(mn_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
mn_venues_sorted,
left_on='Neighborhood',
right_on='Neighborhood',
how='inner')
print(mn_venues_sorted.shape)
print(mn_venues.shape)
print(mn_venues_merged.shape)
# mn_venues_merged.head() # check the last columns!
mn_venues_sorted.head(10)
Finally, let's visualize the resulting clusters.
makeClusterMap(address='Manhattan, NY' ,city_data=mn_venues_merged , city= 'manhattan', zoom_level=11)
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can also assign a name to each cluster.
First, I'm going to save the dataframes into html files for later use.
for cl in range(0,kclusters):
mn_venues_merged.loc[mn_venues_merged['Cluster Labels'] == cl, mn_venues_merged.columns[[1] + list(range(5, mn_venues_merged.shape[1]))]].reset_index(drop=True).to_html('../tables/manhattan_cl_{}.html'.format(cl))
Cluster 0
mn_cl_0 = mn_venues_merged.loc[mn_venues_merged['Cluster Labels'] == 0, mn_venues_merged.columns[[1] + list(range(5, mn_venues_merged.shape[1]))]].reset_index(drop=True)
mn_cl_0.to_html('../tables/m_cl_0.html')
mn_cl_0
Cluster 1
mn_venues_merged.loc[mn_venues_merged['Cluster Labels'] == 1, mn_venues_merged.columns[[1] + list(range(5, mn_venues_merged.shape[1]))]]
Cluster 2
mn_venues_merged.loc[mn_venues_merged['Cluster Labels'] == 2, mn_venues_merged.columns[[1] + list(range(5, mn_venues_merged.shape[1]))]]
Cluster 3
mn_venues_merged.loc[mn_venues_merged['Cluster Labels'] == 3, mn_venues_merged.columns[[1] + list(range(5, mn_venues_merged.shape[1]))]]
Cluster 4
mn_venues_merged.loc[mn_venues_merged['Cluster Labels'] == 4, mn_venues_merged.columns[[1] + list(range(5, mn_venues_merged.shape[1]))]]
The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that Silhouette Coefficient is only defined if number of labels is 2 <= n_labels <= n_samples - 1.
This function returns the mean Silhouette Coefficient over all samples. To obtain the values for each sample, use silhouette_samples.
The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.
metrics.silhouette_score(mn_venues_grouped_clustering, kmeans.labels_, metric='euclidean')
from sklearn.cluster import AgglomerativeClustering
def clustering(data,linkage,n_clusters):
clustering = AgglomerativeClustering(affinity='euclidean',
linkage=linkage,
n_clusters=n_clusters).fit(data)
print(linkage,n_clusters )
print(Counter(clustering.labels_))
print(metrics.silhouette_score(data, clustering.labels_, metric='euclidean'))
for linkage in ['ward', 'complete', 'average', 'single']:
for n_clusters in range(2, 10):
clustering(data=mn_venues_grouped_clustering, linkage=linkage, n_clusters=n_clusters)
from sklearn.cluster import DBSCAN
def dbscan_clustering(data):
dbscan = DBSCAN(eps=1,min_samples=10).fit(data)
print(Counter(dbscan.labels_))
# print(metrics.silhouette_score(data, dbscan.labels_, metric='euclidean'))
dbscan_clustering(data=mn_venues_grouped_clustering)
data_tr
# mn_venues_grouped_clustering = mn_venues_grouped.drop('Neighborhood', 1)
print(mn_venues_grouped.shape)
mn_venues_grouped.head()
# f, ax = plt.subplots(figsize=(45,40))
plt.figure(figsize=(45, 40))
plt.title('Pearson Correlation of features')
heatmap_plot = sns.heatmap(mn_venues_grouped.drop('Neighborhood', 1).corr(),
square=True,
cmap=sns.diverging_palette(20, 220, n=200),
linecolor='black'
)
figure = heatmap_plot.get_figure()
figure.savefig('../plots/heatmap_plot.png', dpi=600)
# f, ax = plt.subplots(figsize=(45,40))
plt.figure(figsize=(45, 40))
plt.title('Pearson Correlation of features')
corr = mn_venues_grouped.drop('Neighborhood', 1).corr()
heatmap_plot = sns.heatmap(corr[(corr >= 0.5) | (corr <= -0.4)],
square=True,
cmap=sns.diverging_palette(20, 220, n=200),
linecolor='black'
)
figure = heatmap_plot.get_figure()
# figure.savefig('../plots/heatmap_plot.png', dpi=600)
# Make an instance of the Model
# pca = PCA(n_components=20)
pca = PCA(0.99)
pca.fit(mn_venues_grouped.drop('Neighborhood', 1))
pca.n_components_
# pca.explained_variance_ratio_
data_tr = pca.transform(mn_venues_grouped.drop('Neighborhood', 1))
# A loop will be used to plot the explanatory power for up to 10 KMeans clusters
ks = range(1, 15)
inertias = []
# mn_venues_grouped_clustering = mn_venues_grouped.drop('Neighborhood', 1)
# mn_venues_grouped_clustering = mn_venues_grouped.drop('Borough', 1)
for k in ks:
# Initialize the KMeans object using the current number of clusters (k)
km = KMeans(n_clusters=k, init='k-means++', n_init=10, random_state=0)
# Fit the scaled features to the KMeans object
km.fit(data_tr)
# Append the inertia for `km` to the list of inertias
inertias.append(km.inertia_)
# Plot the results in a line plot
plt.plot(ks, inertias, marker='o')
There isn't a clear elbow. I choose k=5 for now. There are more sophisticated ways of picking the nuber of clusters which I will utilize later.
# set number of clusters
kclusters = 5
# mn_venues_grouped_clustering = mn_venues_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, init='k-means++', n_init=10, random_state=0).fit(data_tr)
# check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:kclusters]
Counter(kmeans.labels_)
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
df_data_tr = pd.DataFrame(data_tr)
# df_data_tr.insert(0, 'Cluster Labels', kmeans.labels_)
df_data_tr.head()
Finally, let's visualize the resulting clusters.
# add clustering labels
# mn_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
mn_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
mn_venues_merged = mn_venues
mn_venues_merged = pd.merge(mn_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
mn_venues_sorted,
left_on='Neighborhood',
right_on='Neighborhood',
how='inner')
print(mn_venues_sorted.shape)
print(mn_venues.shape)
print(mn_venues_merged.shape)
# mn_venues_merged.head() # check the last columns!
mn_venues_sorted.head()
makeClusterMap(address='Manhattan, NY' ,city_data= mn_venues_merged , city= 'pca_manhattan', zoom_level=11)
br_venues_one_hot = pd.get_dummies(br_venues['Venue Category'])
print(br_venues_one_hot.shape)
br_venues_one_hot.head()
# add columns to the new dataframe
br_venues_one_hot[['Neighborhood']] = br_venues[['Neighborhood']]
# move borough... columns to the beginning
new_cols_order = list(br_venues_one_hot.columns[-1:]) + list(br_venues_one_hot.columns[:-1])
br_venues_one_hot = br_venues_one_hot[new_cols_order]
# or
# br_venues_one_hot.insert(0, 'Borough', mn_venues['Borough'])
# br_venues_one_hot.insert(1, 'Neighborhood', mn_venues['Neighborhood'])
print(br_venues_one_hot.shape)
br_venues_one_hot.head()
Let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category.
br_venues_grouped = br_venues_one_hot.groupby('Neighborhood').mean().reset_index()
print(br_venues_grouped.shape)
br_venues_grouped.head()
Let's print each neighborhood along with the top 5 most common venues.
num_top_venues = 5
for hood in br_venues_grouped['Neighborhood'][0:5]:
print("----"+hood+"----")
temp = br_venues_grouped[br_venues_grouped['Neighborhood'] == hood].T.reset_index()
temp.columns = ['venue','freq']
temp = temp.iloc[1:]
temp['freq'] = temp['freq'].astype(float)
temp = temp.round({'freq': 2})
print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
print('\n')
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
br_venues_sorted = pd.DataFrame(columns=columns)
br_venues_sorted['Neighborhood'] = br_venues_grouped['Neighborhood']
for ind in np.arange(br_venues_grouped.shape[0]):
br_venues_sorted.iloc[ind, 1:] = return_most_common_venues(br_venues_grouped.iloc[ind, :], num_top_venues)
br_venues_sorted.head()
Now, I'm going to run k-means to cluster the boroughs. I use the elbow method to find the best number of clusters.
# A loop will be used to plot the explanatory power for up to 10 KMeans clusters
ks = range(1, 15)
inertias = []
br_venues_grouped_clustering = br_venues_grouped.drop('Neighborhood', 1)
# br_venues_grouped_clustering = br_venues_grouped.drop('Borough', 1)
for k in ks:
# Initialize the KMeans object using the current number of clusters (k)
km = KMeans(n_clusters=k, random_state=0)
# Fit the scaled features to the KMeans object
km.fit(br_venues_grouped_clustering)
# Append the inertia for `km` to the list of inertias
inertias.append(km.inertia_)
# Plot the results in a line plot
plt.plot(ks, inertias, marker='o')
There isn't a clear elbow. I choose k=5 for now. There are more sophisticated ways of picking the nuber of clusters which I will utilize later.
# set number of clusters
kclusters = 5
mn_venues_grouped_clustering = mn_venues_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(br_venues_grouped_clustering)
# check cluster labels generated for each row in the dataframe
# kmeans.labels_[0:kclusters]
Counter(kmeans.labels_)
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each borough.
# add clustering labels
# br_venues_sorted.drop('Cluster Labels', axis=1,inplace=True)
br_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
br_venues_merged = br_venues
br_venues_merged = pd.merge(br_venues[['Borough','Neighborhood','Neighborhood Latitude', 'Neighborhood Longitude']].drop_duplicates(),
br_venues_sorted,
left_on='Neighborhood',
right_on='Neighborhood',
how='inner')
print(br_venues_sorted.shape)
print(br_venues.shape)
print(br_venues_merged.shape)
br_venues_merged.head() # check the last columns!
Finally, let's visualize the resulting clusters.
makeClusterMap(address='Brooklyn, NY' ,city_data=br_venues_merged , city= 'brooklyn', zoom_level=11)
Now, we can examine each cluster and determine the discriminating venue categories that distinguish each cluster. Based on the defining categories, we can also assign a name to each cluster.
for cl in range(0,kclusters):
br_venues_merged.loc[br_venues_merged['Cluster Labels'] == cl, br_venues_merged.columns[[1] + list(range(5, br_venues_merged.shape[1]))]].reset_index(drop=True).to_html('../tables/brooklyn_cl_{}.html'.format(cl))
Cluster 0
br_venues_merged.loc[br_venues_merged['Cluster Labels'] == 0, br_venues_merged.columns[[1] + list(range(5, br_venues_merged.shape[1]))]]
Cluster 1
br_venues_merged.loc[br_venues_merged['Cluster Labels'] == 1, br_venues_merged.columns[[1] + list(range(5, br_venues_merged.shape[1]))]]
Cluster 2
br_venues_merged.loc[br_venues_merged['Cluster Labels'] == 2, br_venues_merged.columns[[1] + list(range(5, br_venues_merged.shape[1]))]]
Cluster 3
br_venues_merged.loc[br_venues_merged['Cluster Labels'] == 3, br_venues_merged.columns[[1] + list(range(5, br_venues_merged.shape[1]))]]
Cluster 4
br_venues_merged.loc[br_venues_merged['Cluster Labels'] == 4, br_venues_merged.columns[[1] + list(range(5, br_venues_merged.shape[1]))]]
metrics.silhouette_score(br_venues_grouped_clustering, kmeans.labels_, metric='euclidean')
for linkage in ['ward', 'complete', 'average', 'single']:
for n_clusters in range(2, 10):
clustering(data=br_venues_grouped_clustering, linkage=linkage, n_clusters=n_clusters)
Now I add the housing sale price from https://data.cityofnewyork.us/Housing-Development/NYC-Calendar-Sales-Archive-/uzf5-f8n2
all_files = glob.glob("../data/housing/Manhattan Housing Sales Data/*.xls")
mn_sales = pd.DataFrame()
for f in all_files:
mn_sale = pd.read_excel(f, index_col=None,header=None, skiprows=5 )
mn_sales = mn_sales.append(mn_sale)
columns = pd.read_excel(glob.glob("../data/housing/Manhattan Housing Sales Data/2015*.xls")[0],header=None).iloc[4].str.rstrip()
mn_sales.columns = columns
mn_sales.to_csv('../data/manhattan_sales.csv')
mn_sales = pd.read_csv('../data/manhattan_sales.csv')
mn_sales.drop(columns = ['Unnamed: 0'], axis=1, inplace =True)
# mn_sales['SALE DATE'] = datetime.strptime(mn_sales['SALE DATE'], '20%y-%m-%d')
mn_sales['SALE DATE'] = mn_sales['SALE DATE'].astype('datetime64[ns]')
mn_sales['year'] = pd.DatetimeIndex(mn_sales['SALE DATE']).year
mn_sales['month'] = pd.DatetimeIndex(mn_sales['SALE DATE']).month
mn_sales['NEIGHBORHOOD'] = mn_sales['NEIGHBORHOOD'].str.strip()
pd.DataFrame(mn_sales.columns, columns=['columns'])
mn_sales = mn_sales.iloc[:, [1,16,19,21,22]]
print(mn_sales.shape)
mn_sales.to_csv('../data/manhattan_sales_flask.csv')
mn_sales.head()
mn_sales['SALE PRICE'].describe()
Wow there's a sale price of more than 4 billion dollars. Out of curiosity let's see this sale.
mn_sales.loc[mn_sales['SALE PRICE'].idxmax()]
Looks like this is a giant block building.
Let's see what are the other highest sales values.
mn_sales['SALE PRICE'].sort_values(ascending=False).head(10)
Let's get the 90% percentile value.
mn_sales['SALE PRICE'].quantile(0.95)
mn_sales.groupby(['year','month'])['SALE PRICE'].agg(['count','mean','median']).reset_index().head(10)
We can see there is a big difference between mean and median.
Now, let's take a look at the distribution price of housing sales for each year.
I write a function to generate all the histograms I want.
def makeHistPlots(data, time_period, start, stop, step):
ncols = 2
nrows = int(np.ceil(len(time_period) / (1.0*ncols)))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(30, 20))
bin_values = np.arange(start=start, stop=stop, step=step)
counter = list(time_period)[0]
for i in range(nrows):
for j in range(ncols):
ax = axes[i][j]
# Plot when we have data
if counter in time_period:
ax.hist(data[data['year'] == counter]['SALE PRICE'], bins=bin_values, color='blue', alpha=0.5, label='{}'.format(counter))
ax.set_xlabel('Price')
ax.set_ylabel('')
# ax.set_ylim([0, 5])
leg = ax.legend(loc='upper right')
leg.draw_frame(False)
# Remove axis when we no longer have data
else:
ax.set_axis_off()
counter += 1
plt.show()
I'm gonna look at the home values up to $4,500,000. (The 95% percentile.)
makeHistPlots(data=mn_sales, time_period = range(2003,2016),start=0, stop=4500000, step=100000)
There are high number of houses sold for under $100K. This is highly unusual if not impossible for manhttan. This should be one of the factors contributing to the high difference between the mean and median values.
Let's take a closer look at the values under $100k.
print('Out of total {} records with under $100K value, there are {} records where the sale price is $0.'.format(sum(mn_sales['SALE PRICE'] <= 100000) ,sum(mn_sales['SALE PRICE'] == 0)))
data = mn_sales[mn_sales['SALE PRICE'] <= 100000]
bin_values = np.arange(start=0, stop=100000, step=5000)
data['SALE PRICE'].hist(bins=bin_values, figsize=[14,6], grid=False)
So we can see there are numerous records with values under $10k.
I'm going to remove all the records with values under $100k and get the basic statistics and plot the histograms again.
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data['SALE PRICE'].describe()
print('The 95% percentile is ', data['SALE PRICE'].quantile(0.95))
I'm gonna look at the home values up to $6,200,000. (The 95% percentile.)
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
makeHistPlots(data=data, time_period = range(2003,2016),start=0, stop=6200000, step=100000)
There is a sharp drop in homes with 1 to 1.1 million dollars. One resaon could be that the seller would rather list their home under 1 million dollars so they sell it easier rather that, for ex, 1.05 million dollars.
Let's see change of monthly median sale price over time.
plt.figure(figsize=(35,10))
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data = data.groupby(['year','month']).median().reset_index()
x = data['year'].map(str) + ' ' + data['month'].map(str)
y = data['SALE PRICE']
plt.xticks(rotation=90)
# plt.fill_between(x, y, color='#539ecd')
plt.plot(x,y,linewidth=3)
# sns.lineplot(x,y)
Let's get the median price of homes in each neighborhoods for the whole period and plot them on a bar plot.
plt.figure(figsize=(35,10))
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
mn_median = data['SALE PRICE'].median()
data = data.groupby(['NEIGHBORHOOD']).median().reset_index().sort_values(by=['SALE PRICE'],ascending=False)
x = data['NEIGHBORHOOD']
y = data['SALE PRICE']
plt.xticks(rotation=90)
sns.barplot(x,y)
# plt.bar(x,y)
plt.axhline(y=mn_median,linewidth=3)
The'MANHATTAN-UNKNOWN' neighborhood has a very high median. Let's examin that closer.
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data[data['NEIGHBORHOOD']== 'MANHATTAN-UNKNOWN'].loc[:,'SALE PRICE']
There are only 11 sales in this category. I'm going to remove it from my visualization.
plt.figure(figsize=(35,10))
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data = data[data['NEIGHBORHOOD'] != 'MANHATTAN-UNKNOWN']
mn_median = data['SALE PRICE'].median()
data = data.groupby(['NEIGHBORHOOD']).median().reset_index().sort_values(by=['SALE PRICE'],ascending=False)
x = data['NEIGHBORHOOD']
y = data['SALE PRICE']
plt.xticks(rotation=90)
sns.barplot(x,y,linewidth=3)
# plt.bar(x,y)
plt.axhline(y=mn_median)
Let's also take a look at the same graph for the last year of data.
plt.figure(figsize=(35,10))
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data = data[data['year'] == 2015]
data = data[data['NEIGHBORHOOD'] != 'MANHATTAN-UNKNOWN']
mn_median = data['SALE PRICE'].median()
data = data.groupby(['NEIGHBORHOOD']).median().reset_index().sort_values(by=['SALE PRICE'],ascending=False)
x = data['NEIGHBORHOOD']
y = data['SALE PRICE']
plt.xticks(rotation=90)
sns.barplot(x,y)
# plt.bar(x,y)
plt.axhline(y=mn_median)
plt.savefig('../plots/2015_median.png')
Let's take a quicker look at "FASHION" neighborhood. There isn't actually a neighborhood with this exact name in manhattan.
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data[data['NEIGHBORHOOD'] == "FASHION"].count()
The addresses belong to midtown west are of manhattan. Let's see how many sales for each month there are.
def neighborhoodVsWholeGraph(data,neighborhood):
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data = data[data['NEIGHBORHOOD'] == neighborhood].groupby(['year','month']).median().reset_index()
x = data['year'].map(str) + '_' + data['month'].map(str)
y = data['SALE PRICE']
data_whole = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data_whole = data_whole.groupby(['year','month']).median().reset_index()
x2 = data_whole['year'].map(str) + '_' + data_whole['month'].map(str)
y2 = data_whole['SALE PRICE']
fig, ax = plt.subplots(1,1,figsize=(35,10))
ax.plot(x,y, linewidth=3, label= neighborhood)
ax.set_xticks(x[::3])
ax.set_xticklabels(x[::3], rotation=70, fontsize=14)
ax.set_yticklabels(ax.get_yticks(), fontsize=16)
ax.plot(x2,y2, linewidth=3, label= 'Manhattan')
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.set_ylabel('Price', fontsize=30)
ax.set_xlabel('Time', fontsize=30)
ax.xaxis.get_label().set_style('italic')
ax.yaxis.get_label().set_style('italic')
ax.legend(loc='upper left', prop={'size': 20}, frameon=False)
# plt.show()
plt.savefig('../plots/neighborhoodVsWholeGraph/{}VsWholeGraph.png'.format(neighborhood))
# plt.figure(figsize=(35,10))
# plt.title('Median Monthly Housing Sale Price',fontsize=24)
# plt.ylabel('Price', fontsize=20)
# plt.xlabel('Time', fontsize=20)
# plt.xticks(rotation=90)
# plt.plot(x,y,linewidth=3, label= neighborhood)
# plt.plot(x2,y2,linewidth=3, label= 'Manhattan')
# plt.legend(loc='upper left', prop={'size': 20})
listN = mn_sales['NEIGHBORHOOD'].unique()
for neighborhood in listN:
neighborhoodVsWholeGraph(mn_sales,neighborhood)
neighborhoodVsWholeGraph(mn_sales,'CHELSEA')
Let's see the average growth of median monthly housing for each neighborhood and see which ones have the highest value.
data = mn_sales[mn_sales['SALE PRICE'] >= 100000]
data = data.groupby(['NEIGHBORHOOD','year','month']).median().reset_index()
data = data.groupby('NEIGHBORHOOD')['SALE PRICE'].apply(lambda x: 100*(x.iloc[-1] - x.iloc[0])/x.iloc[0]).reset_index()
data.sort_values(by='SALE PRICE', inplace=True, ascending=False)
plt.figure(figsize=(35,10))
x = data['NEIGHBORHOOD']
y = data['SALE PRICE']
plt.xticks(rotation=90)
sns.barplot(x,y,linewidth=3)
# plt.bar(x,y)
plt.savefig('../plots/perChange.png')
I'm going to check the neighborhoods' names from the clustering section to the nrighborhoods from the housing sales data.
mn_data['Neighborhood'].str.lower().sort_values()
pd.Series(mn_sales['NEIGHBORHOOD'].unique())
So the namings are pretty different.
LSTM
# mn_sales_processed = mn_sales_mean_neigh.iloc[:, 4:]
# print(mn_sales_processed.shape)
# mn_sales_processed = mn_sales_processed[mn_sales_processed['SALE PRICE'] >= 100000]
# print(mn_sales_processed.shape)
# scaler = MinMaxScaler(feature_range = (0, 1))
# mn_sales_scaled = scaler.fit_transform(mn_sales_processed)
# features_set = []
# labels = []
# for i in range(8, 130):
# features_set.append(mn_sales_scaled[i-8:i, 0])
# labels.append(mn_sales_scaled[i, 0])
# features_set, labels = np.array(features_set), np.array(labels)
# features_set = np.reshape(features_set, (features_set.shape[0], features_set.shape[1], 1))
# model = Sequential()
# model.add(LSTM(units=50, return_sequences=True, input_shape=(features_set.shape[1], 1)))
# model.add(Dropout(0.2))
# model.add(LSTM(units=50, return_sequences=True))
# model.add(Dropout(0.2))
# model.add(LSTM(units=50, return_sequences=True))
# model.add(Dropout(0.2))
# model.add(LSTM(units=50))
# model.add(Dropout(0.2))
# model.add(Dense(units = 1))
# model.compile(optimizer = 'adam', loss = 'mean_squared_error')
# model.fit(features_set, labels, epochs = 100, batch_size = 32)
# test_inputs = mn_sales_processed[len(mn_sales_processed) - (156-130)-8:].values
# test_inputs = test_inputs.reshape(-1,1)
# test_inputs = scaler.transform(test_inputs)
# test_features = []
# for i in range(8, 34):
# test_features.append(test_inputs[i-8:i, 0])
# test_features = np.array(test_features)
# test_features = np.reshape(test_features, (test_features.shape[0], test_features.shape[1], 1))
# predictions = model.predict(test_features)
# predictions = scaler.inverse_transform(predictions)
# plt.figure(figsize=(10,6))
# plt.plot(mn_sales_processed[130:].values, color='blue', label='Actual Sales Price')
# plt.plot(predictions , color='red', label='Predicted Sales Price')
# plt.title('House Sales Price Prediction')
# plt.xlabel('Date')
# plt.ylabel('House Sales Price')
# plt.legend()
# plt.show()